Last updated: 2019-10-21

Workshop Overview

We will cover:

  • A conceptual introduction to Bayesian analysis (+ MCMC sampling basics)
  • The brms package (with example analyses)
  • The Stan model language (if there's time)

I've tried to include links throughout that will be useful when you come to run your own Bayesian analyses

Intro Overview

  • Why go Bayesian?
  • Introduction to Bayesian analysis
    • What's the purpose of Bayesian estimation?
    • MCMC sampling
    • How good are my samples?
    • Posterior predictive checks
    • Model comparison
    • Summarizing the posterior

Why go Bayesian?

Why go Bayesian?

It tells you what you want to know.

Given the data (and our prior knowledge),

  • What interval contains an unobserved parameter with .95 (or some other) probability?
  • What's the relative weight of evidence for one model (e.g., 'null') vs. another (e.g., an alternative).

Why go Bayesian?

  • We almost always have some prior knowledge of reasonable parameter values - this should be incorporated
  • Results are not dependent on a 'sampling plan' (see, e.g., Kruschke & Liddell, 2018 for more)

Introduction to Bayesian analysis

Purpose

  • To re-allocate credibility over parameters values based on the observed data (Kruschke, 2015)
  • Given the observed data, what parameter values should we most strongly believe in?
  • To obtain this we need to start with a model and some prior expectation as to the probability of certain parameter values in this model (more on this later)

Bayesian estimation

  • \(\theta\) = parameter(s), \(y\) = observed data
  • \(p(y \mid \theta)\) = the likelihood
  • \(p(\theta)\) = the prior
  • \(p(y)\) = probability of the data ("the evidence")

\[ p(\theta \mid y) = \frac{p(\theta)p(y \mid \theta)}{p(y)} \]

Bayesian estimation

\(p(y)\) does not depend on the model parameters so we can omit it in favor of the unnormalized posterior

\[ p(\theta \mid y) \propto p(\theta)p(y \mid \theta) \]

The posterior is proportional to the prior times the likelihood

Likelihood

\(p(y \mid \theta)\)

Likelihood

Some fake data we'll assume comes from a normal distribution. Think of it as time (in seconds) to read a short passage of text for 30 individuals.

Goal is to estimate mean reading time (\(\mu\)) and variability (\(\sigma\))

Likelihood

For this example we can assume the observations are independent and \(f\) is the normal pdf

\[ p(\mathbf{\theta \mid y}) = \prod^i f(\theta \mid y_i) \]

For an introduction, see Etz (2018)

Likelihood

In R:

# likelihood of data for mu = 5, and sd = 10
prod(dnorm(x = Y, mean = 5, sd = 10)) # or exp(sum(dnorm(Y, 5, 10, log = T)))
## [1] 1.820996e-79

If we were doing maximum likelihood estimation we could use optim to search for the parameters that maximize the above (or minimize the negative log likelihood).

For Bayesian estimation we use the likelihood to update our prior beliefs in different parameter values.

Likelihood

Prior

\(p(\theta)\)

Prior

  • What are our expectations for parameters before seeing the data?
  • We will try to use "weakly informative priors" - one possible definition below (from here)

"Weakly informative prior should contain enough information to regularize: the idea is that the prior rules out unreasonable parameter values but is not so strong as to rule out values that might make sense"

Prior

Going back to our reading time example, say we have a fairly good idea of the reading rate (words per second) of the general population:

  • For the passage we gave participants we expect an average time of 40, but it could range from 20 to 60
  • Here we might choose a Normal(mean = 40, sd = 10) prior for \(\mu\)

Prior

For the standard deviation of reading times (\(\sigma\)), we might be less certain:

  • We expect people to vary on average around 10s, but we can't rule out larger values
  • A reasonable choice would be a half-Cauchy(scale = 10) prior on \(\sigma\)
  • Other possible choices: gamma, uniform, \(t\), see Gelman (2006) for discussion

Prior

We can specify the priors separately

Prior

But they determine a joint distribution

Bayesian Estimation

Prior

Another example, linear regression:

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i \]

\[ \epsilon_i \sim \mbox{Normal}(0, \sigma) \]

Or

\[ y_i \sim \mbox{Normal}(\beta_0 + \beta_1x_i, \sigma) \]

Prior

If \(x\) and \(y\) have been scaled (\(z\)-scored), a reasonable choice would be:

  • \(\beta_0 \sim \mbox{Normal}(0, 3)\)
  • \(\beta_1 \sim \mbox{Normal}(0, 3)\)
  • \(\sigma \sim \mbox{half-Cauchy}(2.5)\)

Prior

These priors essentially say that we expect either a positive or negative relationship between \(x\) and \(y\).

If we had strong reason to expect that \(y\) should increase with \(x\) we could instead use:

  • \(\beta_1 \sim \mbox{Normal}^{+}(0, 3)\) (where the "+" means only positive values. Same as a half or folded normal)

Prior - correlation

  • For lme models with correlated random effects, we'll need a prior for the correlation matrix (in other work you might see people put a prior on the covariance matrix)
  • brms and Stan use the LJK prior (after Lewandowski Kurowicka, & Joe, 2009)
  • It has one parameter, \(\eta\)

Prior - correlation

Aside on lme4 convergence

  • For models with lots of random effects lme4 convergence can be an issue
  • This usually isn't an lme4 problem - the model is too complex to be supported by the data
  • With the additional regularizing information in the prior, this should not be an issue in brms/Stan (see, e.g., this blog)
  • So it should be possible to 'keep it maximal'
  • Some still suggest that random effects structure should be simplified on grounds of parsimony (e.g., remove if the 95% credible interval for a random SD or correlation includes zero)

Prior - Summary

  • Setting priors can be tricky (slide after next has useful links)
  • When estimating parameters the main thing is to not be too restrictive (you want to let the data 'speak' and not rule out certain values without good reason)
  • For complex models (e.g., those with transformations of prameters), assessing whether the prior actually expresses what we expect can be difficult
  • It can be useful to look at the prior predictive distribution, which is essentially many simulated data sets using parameters drawn from the prior (extra slides on this at the end)
  • Typically, we will have enough data to 'overwhelm' the prior
  • If you are worried that you do not, you can do sensitivity analyses (i.e., check how much your conclusions depend on the prior)

Prior - Summary

  • For model comparison and particularly hypothesis testing with Bayes factors priors should be selected much more carefully
  • The different models you are comparing are defined by their priors
  • So they should reflect reasonable alternatives
  • For example, comparing a model with a parameter (e.g., mean difference) to one without, if the prior on the parameter is very diffuse (e.g., a normal with large SD) you will likely get strong evidence for the null

Useful papers/ links on priors

MCMC Sampling

Why sample?

  • The posterior, \(p(\theta \mid y)\), is a distribution but the shape of that distribution is not always directly attainable (i.e., no analytic expression).
  • In these situations sampling is needed to approximate the posterior distribution. This is what is offered by software like JAGS, BUGS, Stan etc.

MCMC

Markov Chain Monte Carlo

  • Markov Chain - a "memoryless" chain of events. Each step depends on the current state (e.g. parameter value) and not previous ones
  • Monte Carlo - Repeated random sampling

Goal of MCMC - to approximate a target distribution

MCMC

  • The slides mcmc introduce the basics of MCMC in Bayesian analyses via a simple Metropolis Hastings algorithm (see metropolis.example.R)
  • The important points are:
    • MCMC generates a chain of samples
    • Once the chain has converged on a stable distribution, parameter values are visited in proportion to their posterior probability/ density

How accurate is the MCMC chain?

Things to consider

  • Burn in (or warm up)
  • Auto-correlation
  • Effective Sample Size (ESS)
  • Convergence and the Potential Scale Reduction Factor (PSRF)
  • Thinning

Burn in (or warm up)

Note: Warm up for brms/Stan is more complicated and serves to tune the sampling parameters

Autocorrelation

Autocorrelation

Effective Sample Size

A way of estimating the number of independent samples once accounting for auto-correlation:

\[ ESS = \frac{N}{1 + 2\sum_{k = 1}^{\infty} \rho_k} \]

Where \(\rho_k\) is the auto-correlation at lag \(k\). Think of this as dividing the number of samples by the amount of auto-correlation. In practice the sum stops when the auto-correlation is small (e.g. \(\rho_k < 0.05\); see Kruschke, 2015, p. 184).

Convergence

How do we know that we have converged onto a stable distribution?

Convergence

  • Run multiple sequences (chains) with different starting points
  • Compare variation between different chains to variation within chains. When these are approximately equal we can claim to have converged on the target distribution
  • This is measured via \(\hat{R}\) and conventionally a value \(< 1.1\) is considered converged
  • \(\hat{R}\) is also referred to as the Gelman-Rubin convergence diagnostic or the Potential Scale Reduction Factor (PSRF)
  • The \(\hat{R}\) calculated by brms/Stan is slighly different (also compares the beginning and end of the same chain; see Gelman et al., 2014, pages 284-286)

Convergence

Thinning

Thinning

Thinning

  • If auto-correlation is really bad thinning might help, but it might suggest deeper problems with your model (see Gelman et al., 2014)
  • It has been claimed that thinning is "often unnecessary and always inefficient" (Link & Eaton, 2012)
  • Often it is better to keep the full chains

Posterior Predictive Checking

Posterior Predictive Checking

  • PPCs are a way of evaluating the performance of a particular model and identifying potential areas/ sources of misfit
  • Involves simulating outcomes (\(y_{\mbox{rep}}\)) from the model
  • Incorporating uncertainty in the estimated parameters

Posterior Predictive Checking

For each step in the chain (or a random subset of the chain), simulate \(N\) observations from the model with parameters set to the current step in the chain (where \(N\) is the size of the original data set)

We can compare these simulated outcomes to the observed data

Posterior Predictive Checking

Posterior Predictive Checking

  • We can also calculate specific quantities (e.g., max, min, mean) for each posterior predictive sample and compare to those from the observed data set
  • If a particular statistic (\(T\)) from the observed data is rare under the model predictions, this is indicative of misfit

Posterior Predictive Checking

Posterior Predictive Checking

Model Comparison

Model Comparison

  • Out-of-sample prediction accuracy
    • Approximate leave-one-out cross validation (LOO)
    • Widely Applicable Information Criterion (WAIC)
  • Marginal likelihood
    • Bayes factor
    • Posterior model probability

LOO and WAIC

Both try to estimate the expected log predictive density (elpd) for new data

  • LOO can be approximated by importance sampling, specifically we will use Pareto smoothed importance sampling (psis) as implemented in the loo package (see Vehtari et al., 2017 for details)
  • WAIC is a Bayesian extension of AIC (Watanabe, 2010). It essentially estimates the effective number of parameters of the model and uses this to penalize its predictive accuracy

Larger elpd is better but note that LOO and WAIC are often reported on deviance scale (multiplied by -2), in which case smaller values indicate better fit.

LOO and WAIC

  • Asymptotically, WAIC and LOO should the the same, although LOO is advocated for more strongly where possible (see Vehtari et al., 2017)
  • When not possible (as we might see in some brms examples), \(K\)-fold cross validation can be used

Bayes factors

\[ \frac{p(M_1 \mid y)}{p(M_2 \mid y)} = \frac{p(y \mid M_1)}{p(y \mid M_2)} \times \frac{p(M_1)}{p(M_2)} \]

\[ \frac{p(M_1 \mid y)}{p(M_2 \mid y)} = B_{1,2} \times \frac{p(M_1)}{p(M_2)} \]

  • The Bayes factor, \(B\), is the 'updating factor'
  • How much does our belief in model 1 over model 2 change, having seen the data?

Marginal Likelihood

\[ p(y \mid M_i) = \int_{\theta_i} p(y \mid \theta_i, M_i) p(\theta_i, M_i) d\theta_i \]

  • 😨
  • We will use bridge sampling to estimate the marginal likelihood (see Gronau et al., 2017 for an introduction)
  • There are other approaches such as 'transdimensional MCMC' or the JZS 'default' Bayes factors implemented in the BayesFactor package (only normal models)

Summarizing the Posterior

Summarizing the Posterior

  • Each sample in the chain is a point in the joint parameter space
  • For inference, we'll focus on the marginal distribution of each parameter of interest
  • Usually we'll be interested in the mean/median and quantiles of the posterior

Credible interval and highest density interval

You will see both of these around…

  • 95% Credible Interval: the 2.5% to 97.5% quantiles (output by default in brms)
  • 95% Highest Density Interval: an interval containing 95% of the posterior mass such that values contained within the interval have higher posterior density than values outside the interval (can use HDInterval::hdi() to calculate)

Credible interval and highest density interval

Credible interval and highest density interval

Assessing null values with one model fit

ROPE

Region Of Practical Equivalence

  • Requires setting a boundary around some value (e.g., zero)
  • Values within the boundary are considered "practically equivalent" to the chosen value

ROPE

The Savage-Dickey Bayes factor

  • AKA the Savage-Dickey density ratio (see Wagenmakers et al., 2010 for an introduction)
  • For a point hypothesis regarding a parameter value, we compare the height of the posterior distribution to the height of the prior distribution at that particular value
  • The relative height of the posterior and prior tells us how much our belief in that particular value has changed after seeing the data

The Savage-Dickey Bayes factor

The Savage-Dickey Bayes factor

Summary

Steps we'll follow in our brms examples:

  1. Figure out what model is appropriate for the data at hand
  2. Specify reasonable priors for model parameters (need to be especially careful if you want Bayes factors)
  3. Fit model (using brms, Stan, etc) and ensure chains have converged (we'll cover other possible problems/errors specific to Stan)
  4. Posterior predictive plots - are there areas for improvement?
  5. Refine model, compare competing models, …
  6. Examine posterior quantities (mean, median, CI, HDI)

End of introduction to Bayesian analysis

Extra slides

Prior Predictive Distribution

  • When setting priors for complex models it can be useful to look at the distribution of data implied by the prior distribution
  • We can simulate data from draws from the prior distribution
  • Do these predictions fall within a reasonable range? Domain expertise needed here

The next slides show examples of simulated data for the linear regression example (\(y_i \sim \mbox{Normal}(\beta_0 + \beta_1x_i, \sigma)\)) - with 100 observations of x=0 and 100 of x=1

Prior Predictive Distribution

Prior Predictive Distribution

Prior Predictive Distribution

Prior Predictive Distribution

  • Do these plots look reasonable? Note we could (and should) have looked at other quantities of the simulated data (e.g., min, max, condition diffs)
  • Do they strike the balance of not putting too much prior weight on unlikely (in your expert opinion) outcomes while not being overly restrictive?
  • For psychologists, we'll usually have enough data to 'overwhelm' mildly informative priors